Skip to main content

Full text of "The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture"

See other formats

The Distributed Network Processor: 
a novel off-chip and on-chip interconnection 

network architecture 

Andrea Biagioni, Francesca Lo Cicero, Alessandro Lonardo, Pier Stanislao Paolucci, Mersia Perra, 
Davide Rossetti, Carlo Sidore, Francesco Simula, Laura Tosoratto and Piero Vicini 












INFN Roma c/o Physics Dep. "Sapienza" Univ. di Roma, p.le Aldo Moro, 2 - 00185 Roma, Italy 

Email: [firstname.lastname] @romal 

Abstract — One of the most demanding challenges for the 
designers of parallel computing architectures is to deliver an 
efficient network infrastructure providing low latency, high 
bandwidth communications while preserving scalability. Besides 
off-chip communications between processors, recent multi-tile 
(i.e. multi-core) architectures face the challenge for an efficient 
on-chip interconnection network between processor's tiles. In 
this paper, we present a configurable and scalable architecture, 
based on our Distributed Network Processor (DNP) IP Library, 
targeting systems ranging from single MPSoCs to massive HPC 

The DNP provides inter-tile services for both on-chip and off- 
chip communications with a uniform RDMA style API, over 
a multi-dimensional direct network — see |1| for a definition 
of direct networks — with a (possibly) hybrid topology. It is 
designed as a parametric Intellectual Property Library easily 
customizable to specific needs. The currently available blocks im- 
plement wormhole, deadlock-free packet-based communications 
with static routing. 

The DNP offers a configurable number L, N and M of 
ports — respectively intra-tile I/O ports to ensure connections 
among elements within the same computational tile, on-chip 
communication ones to link different tiles onto the same silicon 
die, and off-chip communication inter-tile I/O ports to link those 
belonging to different dies. — Because of the fully switched 
architecture, the DNP may sustain up to L + N + M packet 
transactions at the same time. 

The DNP has been integrated into the design of an MPSoC 
dedicated to both high performance audio/video processing and 
theoretical physics applications. We present the details of its 
architecture and show some promising results we obtained on 
a first preliminary implementation. 

I, Introduction 

In the '70s, the ever-rising complexity of the numerical 
applications employed by the scientific community started for 
the first time to overcome the computing power offered by 
any single-processor computer platform. The need for higher 
computational capability together with both scaling of silicon 
technology and power constraints, bolstered investigation in 
the field of High Performance Computing (HPC). From the 
beginning, the main issues related to HPC were scalability 
and robustness, in particular for off-chip and off-board com- 

A similar scenario is now unfolding in the embedded world. 
The semiconductor industries, due to limitations similar to 
those the HPC community faced in the past, are now all shift- 
ing their interests from a monolithic processor design approach 
to Multi Processor System on Chip (MPSoC) 0. In this 
case, the challenge is to provide an efficient communication 
infrastructure when traditional backbone bus systems fail on 
guaranteeing full scalability and enough bandwidth between 
several tens of on-chip processing units (micro-controllers, 
DSPs, accelerators, etc). Nowadays we are observing a con- 
vergence between the embedded and the HPC systems, as the 
single HPC computational node becomes more and more often 
a MPSoC. 

In this scenario we propose a novel embeddable and fully 
synthesizable hardware block, the Distributed Network Pro- 
cessor (DNP), that provides a fully scalable and configurable 
network infrastructure (the DNP-Net) for both on-chip and 
off-chip interconnection, suitable for platforms scaling from a 
single MPSoC to a massive HPC system. 

In the following, the elementary processing units con- 
nected by the DNP are called tiles, and multiple potentially- 
heterogeneous tiles can be laid out on a single chip. A single 
tile may be very simple — e.g. consisting of a single fiP plus 
the DNP - or quite complex — e.g. a /iP, one or more DSPs 
plus the DNP, - but every one of them contains a DNP unit. 
The hierarchy may be further deepened: multiple multi-tile 
chips may be assembled on a processing board, and multiple 
processing boards plugged in a rack and wired together to 
build a high-performance HPC parallel system. The DNP is 
connected to the other devices inside the tile via a set of 
so-called intra-tile interfaces, while inter-tile interfaces are 
employed to connect a tile to the other ones, which by the 
way may (on-chip) or may not (off-chip) reside onto the same 

A distinguishing feature of ours is that the same set of 
RDMA-like communication primitives can be uniformly em- 
ployed to address both on-chip and off-chip tiles. Traditionally 
RDMA-capable interconnects — e.g. Infiniband, Myrinet — 
are quite expensive as of hardware resources, needing hun- 
dreds of megabytes of memory, and high-speed offloading 

processors to properly handle the complexities of the lay- 
ered protocol stack or just the virtual memory translations 
of standard operating systems (Linux, Windows), running 
on commodity processors (x86, PowerPC, etc.)- On the one 
hand, typical HPC parallel numerical applications need full 
control of the computing devices, badly tolerating multi-users 
environments while being really allergic to swapping/paging 
regimes — i.e. they use no more than 99% of physical system 
memory. — On the other hand, the data processing and 
real-time part of embedded MPSoC applications need locked 
physical buffers used as targets of DMA transfers to/from data 
acquisition devices, while the virtual memory support is really 
used only for the traditional part of the application — i.e. the 
user interface, configuration logic, etc. — That is why most 
of the times the memory address translation features — e.g. 
virtual-to-physical mapping units or MMUs — of commercial 
processors, even if available, can be turned off, as is the case 
on most embedded operating systems (vxWorks, uCLinux) or 
even IBM BlueGene/L |4). Having this in mind, the DNP has 
been optimized for the no-memory address translation case, 
making it comparatively small in size and relatively scarce in 
used silicon resources, opening the possibility to put multiple 
DNP instances onto the same chip. As a consequence, the 
RDMA primitives can be promoted from a low-level API, onto 
which implementing Message Passing Interface (MPI) — as 
is the case on clusters with Infiniband, — to a full-fledged 
system-wide communication API, uniformly targeting both on- 
chip and off-chip devices. In other words, the same RDMA 
API can be used throughout the full hierarchy of devices. 

The paper skips over all DNP software related topics (com- 
munication libraries, simulators, etc..) which can be found 
elsewhere 0. It is organized into four sections. In the first 
one we give an overview of the DNP architecture, describing 
its modular structure and the available interfaces. In the second 
section we report on the use of the DNP in the framework of 
the SHAPES project, which was actually the motivation for its 
development in the first place. In the next one we collect some 
results of our work and finally we write down our conclusions. 

II. DNP Architecture 

The DNP |7| acts as an off-loading network engine to the 
tile, performing both on-chip and off-chip transfers as well as 
intra-tile data moving. At the inter- tile level it operates as a 
network adapter, while at intra-tile level it can be used as a 
basic DMA controller useful for moving large data chunks — 
i.e. between external DRAM and internal static RAM. — 

The DNP has been developed as a parametric Intellectual 
Property library, meaning that some fundamental architectural 
features can be chosen among the available ones, others can 
be customized at design-time and new ones can be developed. 
A highly modular design has been employed, separating the 
architectural core (DNP CORE) from the interface blocks 
(INTRAT IF and INTERT IF), as shown in Fig[T] The Core 
and the Interfaces are connected by a customizable number of 

M Off-Chip ports INTERT IF N On-Chip ports 

if„ I... I if,„J flF, 







t I t I 











(L + M+N)x( L+M+N ) 


DNP CORE tx ™ 










L Master ports 

Fig. 1. A block diagram of the DNP architecture: the modular design 
separates the architectural core (DNP CORE) from the interface blocks 
(INTRAT IF and INTERT IF). The inter-tile interfaces can be either off-chip 
or on-chip, and in either cases they can be equipped with virtual-channels. 
The intra-tile slave interface is basically used by the software to configure 
and program the DNP. The intra-tile master interfaces are controlled by the 
DNP and are used to stream data in and out. 

This flexibility allows tailoring the DNP architecture to the 
particular use-case, choosing the topology of the DNP-Net: it 
is just needed to define the number M of inter-tile off-chip 
ports, the number N of inter-tile on-chip ports and the routing 
algorithm to automatically achieve the desired configuration 

For example, in the SHAPES project [8), as we will show 
our IP library has been used to implement a 


in section 

3D torus network topology in a multi-tile system architecture. 
Besides to fit with the required network I/O bandwidth, a 
suitable number L of intra-tile master ports can be chosen. 

The DNP architecture is a crossbar switch with configurable 
routing capabilities operating on packets with variable sized 
payload. The implementation of virtual channels (9) on in- 
coming switch ports guarantees deadlock-avoidance. 

The DNP gathers data coming from the intra-tile ports, 
fragmenting the data stream into packets (Figj4|i which are 
forwarded to the destination via the intra-tile or inter-tile ports 
depending on the requested operation. 

A. RDMA architecture and communication robustness 

The DNP RDMA features may be condensed into three 

• A Command Queue, implemented as a hardware FIFO 
queue (CMD FIFO), where the software pushes RDMA 
commands to be asynchronously executed by the DNP. 
A DNP command is composed by seven words contain- 
ing information necessary to perform the required data 
transport operation. 

• A Completion Queue (CQ), which lives in the tile mem- 
ory and is treated as a ring buffer, where the DNP writes 
events, which are simple data structures, and software 
reads them. Events are generated as commands are exe- 
cuted and incoming packets are processed. 





























T D 
I N k 















►r N 




Fig. 2. Examples of on-chip and off-chip network topologies and services 
offered by reconfiguring the parametric DNP. 

2 words 

3 words 

(max 256 words) 

1 word 












- HOPs number 

- ERROR flags 






Fig. 3. Data transfer example: a RDMA GET operation characterized by 
different initiator (INIT DNP), source (SRC DNP) and destination (DST DNP). 
The most common use is with INIT == DST. 

• A Look-up Table (LUT), a hardware memory block 
embedded in the DNP which is accessible by software 
through an intra- tile interface. 

The supported RDMA commands are described by a com- 
pact set of parameters: the command code (LOOPBACK, 
PUT, SEND and GET), the source memory address and DNP 
(SRC DNP), the destination memory address and DNP (DST 
DNP), the length in words. 

The LOOPBACK command describes a memory move 
operation in which the content of a (local) memory area is 
copied into another one. One intra-tile interface is used to 
fetch data and another one to write it to the final destination. 
The RDMA PUT and SEND commands generate a one-way 
network transaction; the DNP executes them by reading the 
local buffer and produces a packet stream targeted to the 
destination DNP. The RDMA GET commands is a two-way 
transaction in which the initiator DNP (INIT DNP) generates 
a request packet destined to the source DNP, which in its turn 
will generate a data packet stream toward the destination DNP 
(see Fig(3] for the most general, three-actors situation). Most 
of the times the destination DNP is the same as the initiator 
so it collapses to the usual RDMA GET operation as found in 
other networks. After execution of each command, the DNP 
optionally writes an event in the CQ, so that the software may 
acknowledge the command as executed, freeing the source data 

Fig. 4. DNP packet overview: the packet is made of fixed size header and 
footer and variable number of payload words. The payload can be up to 256 
words. There is an optional space for an integrity check code (CRC) and 
corrupted packets are flagged by a single bit in the footer. 

buffer for further use. 

The buffers which are used as a source in a RDMA com- 
mand need no special care while those which are destination 
have to be pre-registered into the LUT by the software. The 
LUT is organized in records, each one containing the buffer 
physical start address, length and some flags. When a packet is 
received, the LUT is scanned in search for an entry matching 
the packet destination buffer; only in this case the operation is 
carried on. After writing (PUT and SEND) or reading (GET) 
the data, a corresponding event is written in the CQ, so that the 
software may carry on further operations — e.g. deregistering 
the buffer, sending back some acknowledge packets, etc. — 
SEND packets are similar to PUT ones with null destination 
address, so that the first suitable buffer in the LUT is picked 
up and used as the target buffer. They are vital for the software 
application to bootstrap the RDMA protocol at the beginning 
— e.g. after buffers allocation to communicate their address 
to the other DNPs, — in the eager communication protocol, 
and more in general to exchange small amounts of data — 
e.g. protocol packets. — On the other hand, PUT packets are 
mostly intended for the rendezvous protocol. 

B. DNP packets 

A data sending operation may generate one or more out- 
going packets. The DNP hosts a hardware fragmenter block 
which automatically cuts a data words stream into multiple 
packets stream. A packets is made up of a fixed size envelop 
plus a variable word sized payload (Fig HI. In the current 
implementation, the header is split into a network (NET 
HDR), which basically carries routing information, and a 
RDMA header (RDMA HDR), which is processed only by the 
destination DNP; the footer hosts the packet integrity code and 
the corruption flag. 

Every DNP is uniquely addressed by a 18 bit string, whose 
interpretation depends on the exact details of the network 
topology; address decoding is done in the router module and 
must be customized accordingly. For instance, in a 3D Torus 
network those bits can be evenly split into a (x, y, z) triplet, 
while on a NoC based design there could be an additional 
internal coordinate, i.e. a 4-tuple like (x, y, z, w). 

C. Reliability guarantees 

The DNP architecture lives on some basic assumptions: 

• Neither the destination DNP nor the in-between transient 
DNPs are allowed to drop packets. 

• Each packet is always reliably transferred to its destina- 

• Packet corruption is only allowed in the payload and has 
to be detected and marked in the footer, so that it can be 
handled by the application. 

• There is no low level protocol to signal packet delivery, 
e.g. by n/ack packets. If necessary, it can be implemented 
by the application. 

These assumptions turn into requirements for the hardware 
architecture; it has to be robust enough to protect at least 
the packet envelop (header and footer), e.g. to avoid bad 
routing due to corrupted headers. On-chip bit error rate (BER) 
is assumed negligible, at least at the single tile level with 
mature silicon processes. Most of the current on-chip inter- 
tile communication technologies (NoC) claim to control the 
BER on their own. So the inter-tile off-chip interfaces are 
the one mostly hit by the above requirements: they have 
to employ suitable flow-control techniques to avoid packet 
dropping; some form of redundancy has to be employed to 
protect the packet envelop from corruption, e.g. using checking 
and retransmission techniques. 

D. Core 

The modular approach used to design the DNP architecture 
is also applied to the DNP core (Fig[T|, therefore its main 
functionality is split in several separated modules. 

Devices inside the tile require DNP primitives issuing 
commands via the the intra-tile slave interface (INTRA-TILE 
SLAVE). The Engine (ENG) fetches commands from the CMD 
FIFO and uses them to fill out the packet header. The payload 
data are read by an intra-tile transaction using information 
in the RDMA Controller block (RDMA Ctrl) and the newly 
created packets are forwarded through the Switch port towards 
the proper inter-tile ports PORT_VCH or PORT. 

The routing logic (RTR) configures the SWITCH paths be- 
tween the DNP ports, sustaining up to L+M+N simultaneous 
packet transactions. If more than one packet requires the same 
port, the arbiter block (ARB) applies the arbitration policy to 
solve the contention. The parametrized number L of intra-tile 
master ports (INTRA-TILE MST) guarantees the connection 
among elements within the same tile. 

On receiving of a packet, an intra-tile transaction is carried 
out with information from the RDMA Ctrl block, which wraps 

RDT Tile 

X+ X- Y+ Y- Z+ Z 


M|| S 




S M M 

AHB Multilayer 

M M S 

M S 




(e.g debug, JTAG) 

Fig. 5. The SHAPES RISC+DSP+DNP Tile (RDT) is an example of 
integration of the DNP in a commercial environment (Atmel Diopsis 940) 
to provide networking capabilities. In this drawing, the M and S boxes are 
respectively master and slave interfaces. DPM and DDM are DSP Program 
and Data memories. POT is the set of peripherals. DXM is the external data 
memory controller. 

the LUT inside. Each RDMA transaction is followed by 

completion operation (see sect. II-Ai. 

Besides the CMD FIFO, both a set of registers (REG) 
and the RDMA Look-Up Table (LUT) are accessible through 
the intra-tile slave port. The registers are used to expose 
status information and to configure the DNP functionality; 
i.e. hand-shake protocols among blocks are often time-out 
based with exception rising, so that time-out thresholds, as 
well as arbitration logic choice and the port priority scheme, 
are configurable this way. Moreover, some registers allow for 
resetting and dis/enabling of blocks inside the DNP at run time 
by software. 

E. Intra-tile and inter-tile Interfaces 

As previously described, the modularity depicts the DNP 
design. The DNP may be adapted to different environments 
with minor rewriting efforts, being the interfaces completely 
independent from the core. Given network topologies lead 
to different protocol choices as well as error checking and 
correction requirements. Keeping some of these features out 
of the Core module makes it smaller and cleaner, moving the 
potential complexity to the inter-tile or intra-tile interfaces. 

At intra-tile level the DNP exposes a proprietary bus pro- 
tocol while at inter-tile ports a FIFO like signaling is used. 

The intra-tile interfaces are in charge of translating the DNP 
transactions into the particular protocol used inside the tile. 
These interfaces are specific to the different bus architectures 
— e.g. AMBA-AHB, AMBA-AXI, PCI, PowerPC, etc. — 
In our library we provide AMBA-AHB adaptors for intra- 
tile communication, either for the master (MASTER) and 
the slave (SLAVE) interfaces. On the other hand, PCI/PCI- 
X/PCI-e commercial cores can be easily used by developing 
customized wrapper blocks. The implemented INTRA-TILE 
SLAVE adapter maps the registers, the LUT and the command 


Fig. 6. 8-RDTs SHAPES chip arranged in a 3D lattice. In each chip the 
RDT tiles are connected by the ST-Spidergon NoC. 

The inter-tile interfaces are responsible for on-chip and off- 
chip interconnection. While the DNP can be connected to 
highly specialized on-chip network infrastructure — e.g. a 
NoC, — however the DNP infrastructure allows for point- 
to-point interconnection with simple interface blocks. 

A very wide range of solutions may be applied for off- 
chip communication depending on the requirements. In our 

case study (see sect. Ill I, the on-chip communication is NoC 

based while for off-chip we provide a bidirectional Serial- 
izer/Deserializer (Ser/Des) with error check, DC-balance and 
re-transmission capability. The NoC used is the ST-Spidergon, 
while the Ser/Des block was a proprietary solution. 

III. A Case Study: the SHAPES computing 


The DNP was developed in the context of the SHAPES 
European project, proposing a scalable HW/SW design style 
for future CMOS technologies. 

Besides the DNP, a typical SHAPES tile contains a 
VLIW floating-point DSP (Atmel mAgicV) with its on- 
chip Distributed Data/Program Memory (DDM-DPM), a RISC 
(ARM9), an interface to Distributed External Memory (DXM) 
and a set of peripherals called the Peripherals on Tile (POT) 
e.g. JTAG, Ethernet, etc.; in SHAPES jargon, this tile config- 
uration is called RISC+DSP+DNP Tile (RDT) (Figj5). 

The SHAPES network connects on-chip and off-chip tiles, 
weaving a distributed packet-switching network. 

The SHAPES software tool-chain provides a simple and 
efficient programming environment for tiled architectures. 
SHAPES proposes a layered system software (8), which 
dramatically helps the programmer to harness the intrinsic 
complexity of fully scalable multi-processor systems. 

A. The DNP in the SHAPES architecture 

In the SHAPES framework the intra-tile interfaces are 
AMBA-AHB standard connected to the DSP and RISC 
through an AMBA-AHB multilayer bus (FigjSl. For on-chip 
communications, the SHAPES architecture avails itself of the 

ST-Spidergon Network-on-Chip (10) flj] [El to which the 
DNP is connected by a specific interface, called the DNP 
Network-on-Chip Interface (DNI). The 3D Torus topology has 
been adopted for off-chip networking (Fig|6]i, with all node- 
connecting bidirectional links, which needs a total of six inter- 
tile interfaces per DNP. This configuration of ports gives the 
particular SHAPES tile a render of the DNP with 1 = 2, 
M = 6 and N = 1. 

The DNP applies a deterministic routing policy to imple- 
ment communications on the 3D torus network. The coordi- 
nates evaluation order (e.g first Z is consumed, then Y and 
eventually X ...) can be chosen at run-time by writing into a 
specialized priority register; in this way, the routing scheme 
is configurable to a certain extent. 

1) The On-Chip Interface: the DNI is the on-chip bidi- 
rectional interface handling DNP transmissions to/from the 
ST-Spidergon NoC. The communication protocol implied is 
a hand-shake protocol based on a request/grant policy. This 
interface includes a sub-module that verifies data by means 
of a Cyclic Redundancy Check (CRC). During the packet 
delivery process a CRC is computed and transmitted together 
with the footer. On receiving, that CRC is recalculated and 
checked, so in case of transmission errors a bit in the footer 
is set and the packet goes on its way. The software can detect 
packet corruptions by analysing the footer. 

The ST-Spidergon NoC implements deadlock avoidance by 
its own, therefore no virtual channels are necessary on the 
DNP port side. 

2) The Off-Chip Interface: the inter-tile off-chip interface 
has a parallel clock SerDes architecture, employing Double 
Data Rate signaling in order to reduce packet transmission la- 
tency. Special encoding and a DC-balance block guarantee the 
quality of the transmission line. The balancing is performed 
inverting the transmitted word to equalize the number of 1 
and bits in time. Furthermore, the inter- tile off-chip interface 
implements the mesochronous clocking technique in order to 
handle the clock-phase skew between communicating DNPs. 

It manages the data flow encapsulating the DNP packets into 
a light, low-level protocol able to detect transmission errors 
via CRC, and includes a memory buffer to re-transmit the 
header and the footer in case of transmission errors. Therefore 
the protocol assures the delivery of the packet, avoiding non- 
recoverable situations where badly corrupted packets (with 
errors in the header or footer) pose threat to the global routing. 
In case of packets with payload errors (signaled by the footer), 
the software communication library is in charge of handling 
the situation. 

The chosen CRC polynomial generator is the same used 
for the inter-tile on-chip interface: it is the industry-standard, 
well-known CRC- 16. Past experiences collected on a custom 
developed massively parallel computing architecture lfT3llfT4ll 
has given us insight into the bit error rate of a typical off- 
chip LVDS link, while proving the CRC- 16 as a satisfactory 

W ►pDNP ■ 




Matrix [ 

(a) On-chip communication ST-NoC based (MTNoC) (b) On-chip communication 2D-DNP based (MT2D) 

Fig. 7. Details of the explored solutions: MTNoC and MT2D 

B. Exploring the DNP high configurability 

Within the SHAPES project, thanks to the high level of 
parametrization offered by the DNP, we were able to propose 
different solutions for the inter-tile on-chip network. Besides 
the ST-Spidergon NoC (MTNoC), an alternative solution has 
been proposed that involves an on-chip 2D mesh network, 
obtained using the DNP inter-tile ports connected point-to- 
point (MT2D). 

This experiment allowed having two solutions suitable for 
possibly different application requirements (FiglTjl. 

IV. Results 

The DNP architecture has been validated in the SHAPES 
project by using the DNP TLM SystemC lfl5l model integrated 
in different multi-tile processor configurations. In particular, 
the DNP was employed in benchmarking the SHAPES archi- 
tecture on a kernel code [16| for Lattice Quantum Chromo 
Dynamics (LQCD), and tested on a system configuration of 
8 RDTs arranged in a 2 x 2 x 2 3D topology (one of those 
depicted in FigJ7]». 

At the same time, an RTL model (written in VHDL) has 
been used to simulate and validate the DNP implementation, 
with both ad-hoc stimuli test vectors and mixed SystemC- 
VHDL simulations. Along with preliminary measures of intra- 
tile and inter-tile bandwidth and latency, we estimate area 
consumption and power dissipation on a early place-and-route 
layout. In the following, bandwidth figures are meant unidi- 
rectional, i.e. for each direction, unless explicitly specified. 

Depending on the particular intra-tile interface adopted in 
the destination design — e.g. matching a bus with reduced 
width or with lower frequency, — the DNP intra-tile port is 
able to sustain up to 1 word/cycle (1 word equals 32 bits), 
so the resulting intra-tile bandwidth is BWi nt = L x 32 = 
64bit/cycle, roughly 4GB/s at 500MHz, or 4 + 4GB/s bidirec- 
tional bandwidth. The latency for intra-tile communications — 









1 mtQV DNP 

I I I I I \ , \^7(init=siT=dst) 

Fig. 8. Details on the timing measurement solution adopted for a LOOP- 
BACK operation. L\ represents the time elapsed between the command 
issuing and the beginning of the read intra-tile transaction, while L^ is the 
time necessary to complete LOOPBACK operation and to begin the write 
intra-tile operation. 

i.e. the time interval elapsed from the LOOPBACK command 
reaching the CMD FIFO to the first outgoing data surfacing 
onto the bus, as in FigjS] — is Li nt — L\ + L2 — 100 cycles, 
equal to 200ns at the target frequency. 

Inter-tile on-chip ports are designed to be connected by 
point-to-point parallel links and support a bandwidth of 
1 word/cycle as well. Because of the DNP distributed network 
architecture, it is possible to arrange tiles inside the chip in a 
way to minimize the wire lengths and the dissipated power 
due to data transmission. In this ambit a parallel on-chip 
link is still a valid choice considering the trade-off between 
power dissipation and bandwidth, as the design of many NoC's 

While in theory inter-tile off-chip ports may also sustain 
up to 1 word/cycle network load, the actual off-chip channel 
bandwidth is inherently dependent on the serialization factor 



Fig. 9. Single-hop PUT command with 1 word payload. L\ represents the 
time elapsed between the command issuing and the beginning of the read 
intra-tile transaction; L2 is the time necessary to transmit the first header 
word of the packet to the appropriate inter-tile interface across the switch; 
L3 is the transmission time over the serialized off-chip interface toward the 
DST TILE; L4 is the time down to the intra-tile write operation. 

L 4 — 

w — 



L 2 ~ 



50 100 150 200 250 300 350 


Fig. 10. The broken-out timings for the PUT command in FigRJlare graphed 




50 100 150 200 250 300 350 


Fig. 11. Double-hop timings for a PUT command. L% is the time elapsed 
for the hop across an additional DNP, containing the trasmission time over 
the second serialized off-chip interface; it partially overlaps the L3 for the 
first hop. 

adopted — defined as the ratio between the DNP internal 
data width and the number of serial lines — on the link. In 
the SHAPES implementation we chose a serialization factor 
equal to 16, in order to minimize the number of double data 

Place&Route trials for both MTNoC and MT2D DNP AT 







on-chip ports (N) 



off-chip ports (M) 



estimated area 

1.30mm 2 

1.76mm 2 

estimated power 



rate physical wires (and corresponding multi-tile processor 
chip pin count), to allow for high transmission frequencies 
and distances of the order of some meters, and to simplify 
the design of the processing board PCB. The above choice 
leads to a off-chip network bandwidth equal to 4 bit/cycle. 
Taking into account all the factors, the inter-tile on-chip and 
off-chip bandwidth are BW on -chip = N x 32 bit/cycle and 
BW ff- c hip = M x 4 bit/cycle per direction. 

The associated single-hop latencies (Fig|9]and 10 1, defined 
as the time interval elapsed since the writing of the RDMA 
command on the sender DNP CMD FIFO until the first word 
of the packet header is written on the destination DNP intra- 
tile interface, are L on ^ c hi P = L\ + Li + L4 ~ 130 and 
L ff_ c f lip = L\ + L 2 + L3 + L4 ~ 250 cycles, respectively 
260ns and 500ns at 500MHz. The cost in latency of an addi- 
tional hop (FigfTT) over an off-chip interface, for example to 
reach the nextneighbor DNP, is 100 cycles, which is less than 
the naive guess of Ul + L3 ~ 150 cycles thanks to wormhole 
routing. By the way the relative high value of l ff-chip is 
influenced by the latency introduced by serialization and is 
dependent on the serialization factor. 

We performed the Place&Route trial of both the MTNoC 
and MT2D inter-tile interconnection infrastructure, introduced 
using 45nm silicon process at 500MHz. The 


in sec. 

current DNP area estimates in the hypothesis of MTNoC 
and the MT2D architectures are respectively 1.30mm 2 and 
1.76mm 2 ; the larger occupation area for the latter is mainly 
due to the higher number of on-chip ports (3 ports vs. 1 port), 
implying a more complex switch matrix architecture and a 
larger number of DNP data buffers, a complexity which is 
moved in the NoC block in the MTNoC case. 

Furthermore it must be taken into account that these esti- 
mates were yielded by Place&Route trials where the DNP data 
buffers were synthesized using registers in place of memory 
macros, currently not available in the silicon process library. 
We expect to halve this area in the final design. 

Estimated figures for the power consumption for a single 
DNP in the RDT computational tile are 160mW in the MTNoC 
case and 180mW in the MT2D case, with the same silicon 
process and frequency, amounting about to 1/4 of the tile 
dissipation figure. 

A motherboard built by 32 multi-tile processor chips (where 
one chip includes 8 RDTs) would provide 1 Tera-Flops of 
floating point computing power with roughly 6001^ of peak 

power consumption. 

V. Conclusion and future work 

In the SHAPES project framework, we proved the paramet- 
ric and configurable architecture of the DNP to be effective 
for architectural exploration of on-chip and off-chip intercon- 
nection networks, to rapidly yield an efficient design. 

As of today, the inter-tile off-chip interfaces are imple- 
mented with a serialization factor equal to 16; there is room 
for considerable improvements in bandwidth and latency, 
either reducing the serialization factor to 8 or increasing the 
switching frequency of the off-chip physical links. Using the 
target 45nm silicon process technology we expect to double 
the current switching frequency pushing it up to 1GHz. 

The routing function of the network and the set of supported 
commands are currently implemented by hard-coded logic 
blocks; the option to instead have a /xP in its place is currently 
under study. This replacement would be a significant step 
ahead towards a more thoroughly programmable DNP. 

Moreover in future releases of the DNP we also plan to 
strengthen the robustness of the network infrastructure inte- 
grating the minimal hardware redundancy needed to support 
the well-known specific fault-tolerant routing methods for 
torus-based, point-to-point network ifTTl ifTHIl . 

Considering our know-how on massively parallel architec- 
tures, collected in more than 20 years of research and devel- 
opment in the field, we think that DNP can be a key feature 
to assemble HPC systems that target multi-dimensional lattice 
problems of large proportions (i.e. LQCD) [19), for which 
high demands of parallel processing and data communication 
between processing nodes are considered critical. 


The authors would like to thank Dr. R. Ammendola, for 
useful discussions and vital activities on seminal work, and 
Prof. N. Cabibbo for being the benign founder of the research 
group in which all of us do work. 


[1] J. Duato, S. Yalamanchili and L. Ni, Interconnection Network - An en- 
gineering approach, Morgan Kaufmann Publishers Inc., San Francisco, 
CA, 2003. 

[2] W.J. Dally and S. Lacy, VLSI Architectures: Past, Present and Future, 
Proc. Advanced Research in VLSI Conf., IEEE Press (1999) Pages 232- 

[3] M.B. Taylor et at., The Raw Microprocessor: A Computational Fabric for 
Software Circuits and General-Pourpose Programs, IEEE Micro 2002. 

[4] J. Moreira et al., The Blue Gene/L Supercomputer: A Hardware and 
Software Story , International Journal of Parallel Programming, n.3. 
Pages 181-206,June 2007. 

[5] K. Popovici, X. Guerin, F. Rousseau, P.S. Paolucci and A. A. Jerraya, 
Platform-based software design flow for heterogeneous MPSoC, Trans- 
actions on Embedded Computing Systems, Vol.7, Pages 1-23, New York, 
NY, 2008, USA. 

[6] L. Thiele, I. Bacivarov, W. Haid and K. Huang, Mapping Applications 
to Tiled Multiprocessor Embedded Systems, Proc. 7th Intl Conference 
on Application of Concurrency to System Design (ACSD 2007), IEEE 
Computer Society, Pages 29-40, Bratislava, Slovak Republic, July 2007. 

[7] P.S. Paolucci, P. Vicini et al., Introduction to the Tiled HW Architecture 
of SHAPES, Proc. Design, Automation and Test in Europe (DATE'07), 
Vol.1, Pages 77-82, France, Nice, April 2007. 

[8] P.S. Paolucci, A. A. Jerraya, R. Leupers, L. Thiele, and P. Vicini, Shapes: 
a tiled scalable software hardware architecture platform for embedded 
systems, Proceedings of the 4th international Conference on Hard- 
ware/Software Codesign and System Synthesis CODES+ISSS '06 (Seoul, 
Korea), ACM Press, Pages 167-172, 2006. 

[9] W.J. Dally and C.L. Seitz, Deadlock-Free Message Routing in Multi- 
processor Interconnection Networks, IEEE Trans. Comput. 36 (1987) 

[10] M. Coppola, M. D. Grammatikakis, R. Locatelli, G Maruccia and 
L. Pieralisi, Design of Cost-Efficient Interconnect Processing Units: 
Spidergon STNoC, CRC press, Taylor & Francis group, 2008. 

[11] F. Palumbo, S. Secchi, D. Pani, L. Raffo, A Novel Non-Exclusive 
Dual-Mode Architecture for MPSoCs-Oriented Network on Chip Designs, 
Proceedings of the International Workshop on Systems, Architectures, 
Modeling, and Simulation, Samos, Greece, July 21-24, 2008, Pages 96- 
105, LNCS 5114. 

[12] F. Vitullo, N. E. LInsalata, E. Petri, L. Fanucci, M. Casula, R. Locatelli, 
M. Coppola, Low-Complexity Link Microarchitecture for Mesochronous 
Communication in Networks on Chip, IEEE Journal on Transactions 
on Computers, Vol. 57, No. 9, September 2008. 

[13] F. Bodin et al., The apeNEXT project, Nuclear Physics B - Proceedings 
Supplements, Volumes 106-107, March 2002, Pages 173-176, ISSN 0920- 

[14] R. Ammendola et al., APENet: LQCD clusters a la APE, Nuclear 
Physics B (Proc. Suppl.) 140 (2005) 826-828 

[15] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele, Scalably dis- 
tributed systemc simulation for embedded applications, In Industrial 
Embedded Systems, 2008. SIES 2008. International Symposium on, Pages 
271-274, June 2008. 

[16] M. Luscher, Lattice QCD: From quark confinement to asymptotic 
freedom, Annales Henri Poincare 4:S197-S210,2003. 

[17] Rajendra V. Boppana and Suresh Chalasani, Fault-Tolerant Wormhole 
Routing Algorithms for Mesh Networks, IEEE Transactions on com- 
puters, Vol. 44 n. 7, Pages 848-864, July 1995. 

[18] Rajendra V. Boppana and Suresh Chalasani, Fault-Tolerant Communi- 
cation with Partitioned Dimension-Order Routers, IEEE Transactions 
on Parallel and Distributed Systems, Vol. 10 n. 10, Pages 1026-1039, 
October 1999. 

[19] R. Ammendola, A. Biagioni, S. De Luca, F. Lo Cicero, A. Lonardo, 
P. S. Paolucci, M. Perra, D. Rossetti, C. Sidore, F. Simula, N. Tantalo, 
L. Tosoratto, P. Vicini. Computing for Lattice QCD: New developments 
from the APE experiment, II Nuovo Cimento B, Vol. 123 n. 6-7, Pages 
964-968 (2008).